In this project, We will analyze the White Wine Data and try to understand which variables are responsible for the quality of the wine. First We will try to get a feel of the variables on their own and then we will try to find out the correlation between them and the Wine Quality with other factors thrown in.
Cheers
We have the following variables:
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
We have 12 variables. so what are their types:
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
All of them are numeric. 4898 data points.
Let’s get first numerical overlook at the data.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
The summary shows max really far out for residual.sugar, chlorides, free.sulfur… Those might be outlier/reporting problem or really special wines.
Let’s plot the distribution of each of the variable as I would like to get a feel of the variables first. Based on the distribution shape, i.e. Normal, Positive Skew or Negative Skew and amount of outliers present into them, this will also help us to get some sense what to expect when I plot different variables against each other.
First of all, what about quality of those white wines. Also, quality is a numeric, we will add it as a factor.
It looks like a normal distribution. Most of the wines quality are around 5/6/7. As the good quality and the poor quality wines are almost like outliers here, it might be difficult to get an accurate model of the Wine Quality. Let’s look at the other plots.
Let’s check alchol spread
Much more disparate but we have a nice peak around 9.5% by volume. All in all it still looks like a normal distribution, slight skewed towards left.
Residual sugar is the amount of sugar remaining after fermentation.
X scale is in log10. We have the same type of distribution but a long tail. Peaks is around 1.5g / dm^3 and data seems to go all the way up until 25g / dm^3. Let’s look at the summary.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
Max is at 65 (which is kind of akward) while the 3rd quartile is at 9.9 and the median at 5.2. So those are quite special wines or there is some error in the data.
Chlorides is salt and I would consider they produce not a great taste in wine.
We have a normal distribution once the x axis is transformed using a log10 function. Like residual.sugar, we have a right longtail of data.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
The unit is the same as the residual.sugar but the number are way lower. Median is at 0.043g / dm^3. The max is again more than 80 time higher, not normal.
In small quantities, citric acid can add freshness and pops to wine.
Looking at citric.acid with a log10 scale, we have again a normal distribution. ofcourse with few noticible outlier there.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
Once again we have a median at 0.32 g/dm^3 whereas the max is at 1.6 and the min is at 0.
pH scale from 0 (very acid) to 14 (very basic). Most wines are between 3 and 3.5.
The pH is normally distributed with a peak without the no need for a log scale. We can see that the data is quite disperse but that as the description says, most of the data points are between 3 and 3.5.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
The min and max are once again quite far apart from the median and the 1st and 3rd quartiles respectively, but not so much variation.
Volatile acidity is the amount of acetic acid in wine. At high concentration it gives an unpleasant vinegar taste which, I think, is what a low quality wine taste like.
Still a long tail in the data with a peak around 0.3 g / dm^3 of acetic acid.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
The max is really out of range of the other data points. Median is at 0.26 g / dm^3 of acetic acid and the mean is quite similar ar 0.2782 g / dm^3 of acetic acid.
Total sulfur dioxide is the sum of free (to oxidation of wine) and bound forms of sulfur. At high concentration is can influence taste.
The x axis is transformed using log10. We have several bumps but overall a normal distribution. Some data points seem to be isolated. Let’s look at the variable summary.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
The max is really too high compared to the median and the 3rd quartile. It is an error in data reporting/recording I think.
My main line of inquiries will be the relation between all variables to quality.
Does alcohol level affects the quality?
## [1] 0.4355747
There is a medium correlation (around 0.435) between quality and alcohol, the graph shows that the higher the quality, the better the alcohol. This is especially true for the higher end wines.
We have 3 types of acidity listed:
They should all relate to the ph I think.
## [1] -0.4258583
## [1] -0.03191537
## [1] -0.1637482
So Fixed acidity and citric acid correlate quite strongly with the pH (around -0.42 and -0.16 respectively), but not the volatile acidity (around -0.03).
## `geom_smooth()` using method = 'gam'
Volatile acidity and pH are not related. Even for high quality wines, the pH seems to vary greatly. This confirms the low correlation previously found.
Let’s get a closer look between those pH and quality now.
## $title
## [1] "White Wine pH by quality"
##
## $subtitle
## NULL
##
## attr(,"class")
## [1] "labels"
The higher the quality, the higher the pH it seems, but not very clear.
## [1] 0.09942725
The correlation coefficient between the two is very low at 0.09. So the first impression of the plot is not validated by the number in fact.
What about salt? (Chloride)
As a reminder, the chlorides summary.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
The summary shows that the max is at 0.346 whereas the 3rd quartile is at 0.05. Lets zoom in to have a better view.
The plot is limited between the 0.05 and the 0.95 quantiles. There is some overlaps, but the higher the quality, the least Chloride. (This somewhat prooves our earlier hypothesis that presense of salt affect wine taste badly)
## [1] -0.2099344
The correlation is very low at -0.20. So the relationship between the two variables is not very strong.
Last variable on the obvious affecting taste (and so quality), I think, is sugar.
No clear relationship between the sugar and quality. The scatterplot shows points all over the place. As for the boxplot, the median semms to higher for mid range quality, but nothing special to notice here.
## [1] -0.09757683
Correlation confirms that there is very low relationships (-0.09), almost negligable.
A closer look at chlorides. ***
## `geom_smooth()` using method = 'gam'
The overall trend is downward: less chlorides as the alcohol level gets higher. But we can see that for the top 4 qualities (6,7,8), we got a major concentration around 10 to 12, where salt level rises a bit. It might be some outlier as we only got a few points for those quality.
## [1] -0.2231098
## [1] -0.3199424
## [1] -0.5545504
## [1] -0.5124824
The correlation between chlorides and alcohol are:
Correlations are weak for 5th and 6th level quality, but quite noticible between 7th and 8th levels.
Let’s look at the same thing with sugar:
## `geom_smooth()` using method = 'gam'
Overall the downward trend to notice here. Let’s confirm this by correlation
## [1] -0.4414825
## [1] -0.4549961
## [1] -0.4809369
## [1] -0.5220108
The correlations are not strong:
The correlation for all qualities are kind of consistence, 8th bieng on higher side, but we only have a few data points fior this quyality so the correlation might be due to that.
The trends in the plots for chlorides and residual.sugar look kind of the same, both have negative cor. Chlorides and residual sugar might be linked.
## `geom_smooth()` using method = 'gam'
The relation between chlorides and residual sugar seems to go more wiggly (non-linear) as the quality improves. It might make white wine taste unpleasent and unpredictable in the end.
Let see if the ratio of sugar by chlorides is any help:
No chance here. I was hoping for some clusters of points for each quality level, but they can vary widely.
## [1] 0.02252779
## [1] 0.03512032
## [1] 0.2755308
## [1] 0.3148669
The correlation for quality 5, 6 and 7 is quite low (0.02, 0.03 and 0.27 respectively) but the one for 8 is considerable. This might be because we have fewer point on the 8 quality than on other quality level. Furthermore, we have very far off values in both residual.sugar and chlorides, let’s calculate the same value on a subset limited between the .05 and .95 quantiles.
## [1] 0.1124103
## [1] 0.1979728
## [1] 0.3488694
## [1] 0.417167
Correlations are slightly improved with the removing of the extreme quantiles:
I think this relationship between residual sugar and chlorides is worth investigating to see if other variables might come in play. So let’s map chlorides with sugar and add alcohol as a color and use this as a base for other graphs.
We can see as before that the lower the chloride content, the better alcohol. Furthermore, low alcohol levels seems to have less sugar and high chloride content
## [1] -0.2996015
## [1] -0.3502557
## [1] -0.3309418
## [1] -0.2957432
## [1] -0.2876385
The correlation between residual.sugar_chlorides and alcohol for the subset is -0.299 which is low. Breaking it up by quality we have:
So the correlation does not vary greatly between quality. We do not have a specific relationship for some qualities.
Let’s go back to our acidity variables:
These box plots doesn’t help much in exploration here., Let’s check correlations.
## [1] -0.1136628
## [1] -0.194723
## [1] -0.009209091
Correlation tells that citric acidity correlates very weakly with the quality (-0.009). On the other hand, volatile acidity (-0.19) and fixed acid (-0.11) are somewhat present, especially volatile acidity. This kind of acidity correlates negatively with quality meaning that as the quality improves, the volatile acidity decreases. The vinegar taste brought by volatile acidity is really hurting the quality.
So we now have alcohol, fixed acid and volatile acidity correlating quite strongly with quality. I will get back to the previous facetted chart of residual sugar and chlorides and add citric acid/volatile acidity to see if I can get some more information.
## [1] 0.08184145
we can see that as the volatile acidity level increases, the quality seems to go down a bit (-0.19), showing slight negative correlation there.
citric acid and white wine quality are very less related. Seems to flaten out across all levels
## [1] 0.09948825
## [1] 0.2050855
## [1] 0.02915079
## [1] 0.0073993
## [1] 0.0828936
Correlations show that there is in fact nothing. The correaltion for our subset is 0.09. Breaking by quality:
Density is related to alcohol so let see if we can find something here.
There is some trend going on, better wine quality has higher residual.sugar and higher densities.
## [1] 0.672887
## [1] -0.2996015
## [1] 0.7901922
## [1] -0.3502557
Density relates more to the residual.sugar_chlorides ratio than alcohol: 0.67 vs -0.29. We have the same phenomenom for quality 5: 0.79 for density vs -0.35 for alcohol.
It seems that we get more sulphate, the more residual sugar and chlorides in the wine.
## [1] 0.09948825
## [1] 0.2050855
## [1] 0.02915079
## [1] 0.0073993
## [1] 0.0828936
Correlations show that there is in fact nothing. The correaltion for our subset is 0.09. Breaking by quality:
Density is related to alcohol so let see if we can find something here.
We get the kind of graph with less dense wines on the left side of the graphs whereas more dense wines are in the right region across all quality levels.
## [1] 0.672887
## [1] -0.2996015
## [1] 0.6908107
## [1] -0.3309418
Density relates more to the residual.sugar_chlorides ratio than alcohol: 0.67 vs -0.29. We have the same phenomenom for quality 6 (0.69 for density vs -0.33 for alcohol).
There seems to be no clear picture of sulphate relation with quality, niether with chlorides and residual.sugar.
## [1] -0.0753957
## [1] 0.02227755
## [1] -0.07837732
## [1] -0.1402696
## [1] -0.2718309
The correlation confirms this impression with a correlation of -0.07 for the subset between residual.sugar_chlorides and sulphates. Breaking down by quality:
Correlations by quality are low and increases for quality 8 but this quality8 has only a few data points.
No strong correaltion here.
## [1] 0.0521672
The correlation confirms this (0.05), that no strong relation between the quality and sulpahtes.
Let’s now look at sulfurs.
Both plots have sulfur levels showing on all qualities and on all levels of residual.sugar and chlorides. There is slight positive relationships between those variables.
## [1] 0.2728557
## [1] 0.3028951
Correlation are indeed slight positive:
First of all, there are outliers in the dataset. I tried to remove them when their value seemed to really be out of range. But it might also underlines the diversity of wines. As I am not sure, the following graphs are based on the whole dataset. All in all, there is not some really strong correlations between quality and the other variables presented here. But I could understood some relationships regarding tastes:
volatile.acidity = Vinegar taste, unpleasant actually
The density were scaled so that the low counts of the extreme quality do affect the overall distribution. As the quality of the wines get higher, the volatile acidity is gettinglower. The correlation is not so strong at -0.19. The negative sign signifies that the higher the volatile acidity (and hence the vinegar taste) the lower the quality. It is not always true. This is kind of what we can expect. When you buy cheap wine (which is not always, but most of the time of lower quality) it has most of the time a pugent smell and taste, like vinegar.
citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
citric acid does not play so significant role here to enhance quality.
This graph shows smoothers of alcohol by sulphates for all quality. Quality and alcohol have a strong correlation.
The low quality (3) wines vary a lot but the overal trend is that they are low in alcohol and the alcohol level decreases as the sulphates level increases. Qualities 4 to 7 are flat overall and the only differences come from the alcohol level. Quality 7 and 8 wines are peculiar in that they are grouped in the top left corner. The relationship is not flat like the other quality (except 3) but those wines have high alcohol and less sulphates.
citric acid: the amount of salt in your wine
In the above analysis I tried to document my problems and solutions. I’ll repeat three of the above plots and give my reasons why they are essential.
I include this plot because this was our first strong evident for alcohol quality. We can see as before that the lower the chloride content, the better alcohol. Furthermore, low alcohol levels seems to have less sugar and high chloride content
This plot spurred my interest in exploring the relationship of chlorides and acids.
citric acid and white wine quality are very less related. At the later part of analysis we confirmed this by plotting boxplot of citric acid that it does not play so significant role here to enhance white wine quality.
The overall downward trend here with strong correlation, tells us that less salt in wine better quality.
In this project my main focus was to do only exploratory analysis.
I did the same exercice for the rest of the variables. I was looking for some more adavanced relations between the variables but was not able to find them. I am I get all in all several variables correlating with quality: 1. Alcohol 2. Volatile acidity 3. Citric acid 4. Sulphates
The limits of the dataset are really the lack of points on the lower and higher quality wines. Futhermore, the source of the quality is unknown, is it from a professionnal? A store? And we have to keep in mind that taste is a really cultural thing (see the indian food analysis) and a good wine for someone might not be for another.
Last but not least, we do not have the age of the wine. In France, one of the first thing we check for a wine is its age because common knowledge dictates that an older wine is bet Another thing that I would like is to get the name of the wines so that I can get their prices and see if how they relate to quality.